{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python38564bitvenvvenv9efd950bbdd2494f8072faf3b588558e",
   "display_name": "Python 3.8.5 64-bit ('.venv': venv)",
   "language": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "source": [
    "# Process the Unsplash dataset with CLIP\n",
    "\n",
    "This notebook processes all the downloaded photos using OpenAI's [CLIP neural network](https://github.com/openai/CLIP). For each image we get a feature vector containing 512 float numbers, which we will store in a file. These feature vectors will be used later to compare them to the text feature vectors.\n",
    "\n",
    "This step will be significantly faster if you have a GPU, but it will also work on the CPU."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## Load the photos\n",
    "\n",
    "Load all photos from the folder they were stored."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "Photos found: 24996\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "# Set the path to the photos\n",
    "dataset_version = \"lite\"  # Use \"lite\" or \"full\"\n",
    "photos_path = Path(\"unsplash-dataset\") / dataset_version / \"photos\"\n",
    "\n",
    "# List all JPGs in the folder\n",
    "photos_files = list(photos_path.glob(\"*.jpg\"))\n",
    "\n",
    "# Print some statistics\n",
    "print(f\"Photos found: {len(photos_files)}\")"
   ]
  },
  {
   "source": [
    "## Load the CLIP net\n",
    "\n",
    "Load the CLIP net and define the function that computes the feature vectors"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import clip\n",
    "import torch\n",
    "from PIL import Image\n",
    "\n",
    "# Load the open CLIP model\n",
    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
    "model, preprocess = clip.load(\"ViT-B/32\", device=device)\n",
    "\n",
    "# Function that computes the feature vectors for a batch of images\n",
    "def compute_clip_features(photos_batch):\n",
    "    # Load all the photos from the files\n",
    "    photos = [Image.open(photo_file) for photo_file in photos_batch]\n",
    "    \n",
    "    # Preprocess all photos\n",
    "    photos_preprocessed = torch.stack([preprocess(photo) for photo in photos]).to(device)\n",
    "\n",
    "    with torch.no_grad():\n",
    "        # Encode the photos batch to compute the feature vectors and normalize them\n",
    "        photos_features = model.encode_image(photos_preprocessed)\n",
    "        photos_features /= photos_features.norm(dim=-1, keepdim=True)\n",
    "\n",
    "    # Transfer the feature vectors back to the CPU and convert to numpy\n",
    "    return photos_features.cpu().numpy()"
   ]
  },
  {
   "source": [
    "## Process all photos\n",
    "\n",
    "Now we need to compute the features for all photos. We will do that in batches, because it is much more efficient. You should tune the batch size so that it fits on your GPU. The processing on the GPU is fairly fast, so the bottleneck will probably be loading the photos from the disk.\n",
    "\n",
    "In this step the feature vectors and the photo IDs of each batch will be saved to a file separately. This makes the whole process more robust. We will merge the data later."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "atch 816/1563\n",
      "Processing batch 817/1563\n",
      "Processing batch 818/1563\n",
      "Processing batch 819/1563\n",
      "Processing batch 820/1563\n",
      "Processing batch 821/1563\n",
      "Processing batch 822/1563\n",
      "Processing batch 823/1563\n",
      "Processing batch 824/1563\n",
      "Processing batch 825/1563\n",
      "Processing batch 826/1563\n",
      "Processing batch 827/1563\n",
      "Processing batch 828/1563\n",
      "Processing batch 829/1563\n",
      "Processing batch 830/1563\n",
      "Processing batch 831/1563\n",
      "Processing batch 832/1563\n",
      "Processing batch 833/1563\n",
      "Processing batch 834/1563\n",
      "Processing batch 835/1563\n",
      "Processing batch 836/1563\n",
      "Processing batch 837/1563\n",
      "Processing batch 838/1563\n",
      "Processing batch 839/1563\n",
      "Processing batch 840/1563\n",
      "Processing batch 841/1563\n",
      "Processing batch 842/1563\n",
      "Processing batch 843/1563\n",
      "Processing batch 844/1563\n",
      "Processing batch 845/1563\n",
      "Processing batch 846/1563\n",
      "Processing batch 847/1563\n",
      "Processing batch 848/1563\n",
      "Processing batch 849/1563\n",
      "Processing batch 850/1563\n",
      "Processing batch 851/1563\n",
      "Processing batch 852/1563\n",
      "Processing batch 853/1563\n",
      "Processing batch 854/1563\n",
      "Processing batch 855/1563\n",
      "Processing batch 856/1563\n",
      "Processing batch 857/1563\n",
      "Processing batch 858/1563\n",
      "Processing batch 859/1563\n",
      "Processing batch 860/1563\n",
      "Processing batch 861/1563\n",
      "Processing batch 862/1563\n",
      "Processing batch 863/1563\n",
      "Processing batch 864/1563\n",
      "Processing batch 865/1563\n",
      "Processing batch 866/1563\n",
      "Processing batch 867/1563\n",
      "Processing batch 868/1563\n",
      "Processing batch 869/1563\n",
      "Processing batch 870/1563\n",
      "Processing batch 871/1563\n",
      "Processing batch 872/1563\n",
      "Processing batch 873/1563\n",
      "Processing batch 874/1563\n",
      "Processing batch 875/1563\n",
      "Processing batch 876/1563\n",
      "Processing batch 877/1563\n",
      "Processing batch 878/1563\n",
      "Processing batch 879/1563\n",
      "Processing batch 880/1563\n",
      "Processing batch 881/1563\n",
      "Processing batch 882/1563\n",
      "Processing batch 883/1563\n",
      "Processing batch 884/1563\n",
      "Processing batch 885/1563\n",
      "Processing batch 886/1563\n",
      "Processing batch 887/1563\n",
      "Processing batch 888/1563\n",
      "Processing batch 889/1563\n",
      "Processing batch 890/1563\n",
      "Processing batch 891/1563\n",
      "Processing batch 892/1563\n",
      "Processing batch 893/1563\n",
      "Processing batch 894/1563\n",
      "Processing batch 895/1563\n",
      "Processing batch 896/1563\n",
      "Processing batch 897/1563\n",
      "Processing batch 898/1563\n",
      "Processing batch 899/1563\n",
      "Processing batch 900/1563\n",
      "Processing batch 901/1563\n",
      "Processing batch 902/1563\n",
      "Processing batch 903/1563\n",
      "Processing batch 904/1563\n",
      "Processing batch 905/1563\n",
      "Processing batch 906/1563\n",
      "Processing batch 907/1563\n",
      "Processing batch 908/1563\n",
      "Processing batch 909/1563\n",
      "Processing batch 910/1563\n",
      "Processing batch 911/1563\n",
      "Processing batch 912/1563\n",
      "Processing batch 913/1563\n",
      "Processing batch 914/1563\n",
      "Processing batch 915/1563\n",
      "Processing batch 916/1563\n",
      "Processing batch 917/1563\n",
      "Processing batch 918/1563\n",
      "Processing batch 919/1563\n",
      "Processing batch 920/1563\n",
      "Processing batch 921/1563\n",
      "Processing batch 922/1563\n",
      "Processing batch 923/1563\n",
      "Processing batch 924/1563\n",
      "Processing batch 925/1563\n",
      "Processing batch 926/1563\n",
      "Processing batch 927/1563\n",
      "Processing batch 928/1563\n",
      "Processing batch 929/1563\n",
      "Processing batch 930/1563\n",
      "Processing batch 931/1563\n",
      "Processing batch 932/1563\n",
      "Processing batch 933/1563\n",
      "Processing batch 934/1563\n",
      "Processing batch 935/1563\n",
      "Processing batch 936/1563\n",
      "Processing batch 937/1563\n",
      "Processing batch 938/1563\n",
      "Processing batch 939/1563\n",
      "Processing batch 940/1563\n",
      "Processing batch 941/1563\n",
      "Processing batch 942/1563\n",
      "Processing batch 943/1563\n",
      "Processing batch 944/1563\n",
      "Processing batch 945/1563\n",
      "Processing batch 946/1563\n",
      "Processing batch 947/1563\n",
      "Processing batch 948/1563\n",
      "Processing batch 949/1563\n",
      "Processing batch 950/1563\n",
      "Processing batch 951/1563\n",
      "Processing batch 952/1563\n",
      "Processing batch 953/1563\n",
      "Processing batch 954/1563\n",
      "Processing batch 955/1563\n",
      "Processing batch 956/1563\n",
      "Processing batch 957/1563\n",
      "Processing batch 958/1563\n",
      "Processing batch 959/1563\n",
      "Processing batch 960/1563\n",
      "Processing batch 961/1563\n",
      "Processing batch 962/1563\n",
      "Processing batch 963/1563\n",
      "Processing batch 964/1563\n",
      "Processing batch 965/1563\n",
      "Processing batch 966/1563\n",
      "Processing batch 967/1563\n",
      "Processing batch 968/1563\n",
      "Processing batch 969/1563\n",
      "Processing batch 970/1563\n",
      "Processing batch 971/1563\n",
      "Processing batch 972/1563\n",
      "Processing batch 973/1563\n",
      "Processing batch 974/1563\n",
      "Processing batch 975/1563\n",
      "Processing batch 976/1563\n",
      "Processing batch 977/1563\n",
      "Processing batch 978/1563\n",
      "Processing batch 979/1563\n",
      "Processing batch 980/1563\n",
      "Processing batch 981/1563\n",
      "Processing batch 982/1563\n",
      "Processing batch 983/1563\n",
      "Processing batch 984/1563\n",
      "Processing batch 985/1563\n",
      "Processing batch 986/1563\n",
      "Processing batch 987/1563\n",
      "Processing batch 988/1563\n",
      "Processing batch 989/1563\n",
      "Processing batch 990/1563\n",
      "Processing batch 991/1563\n",
      "Processing batch 992/1563\n",
      "Processing batch 993/1563\n",
      "Processing batch 994/1563\n",
      "Processing batch 995/1563\n",
      "Processing batch 996/1563\n",
      "Processing batch 997/1563\n",
      "Processing batch 998/1563\n",
      "Processing batch 999/1563\n",
      "Processing batch 1000/1563\n",
      "Processing batch 1001/1563\n",
      "Processing batch 1002/1563\n",
      "Processing batch 1003/1563\n",
      "Processing batch 1004/1563\n",
      "Processing batch 1005/1563\n",
      "Processing batch 1006/1563\n",
      "Processing batch 1007/1563\n",
      "Processing batch 1008/1563\n",
      "Processing batch 1009/1563\n",
      "Processing batch 1010/1563\n",
      "Processing batch 1011/1563\n",
      "Processing batch 1012/1563\n",
      "Processing batch 1013/1563\n",
      "Processing batch 1014/1563\n",
      "Processing batch 1015/1563\n",
      "Processing batch 1016/1563\n",
      "Processing batch 1017/1563\n",
      "Processing batch 1018/1563\n",
      "Processing batch 1019/1563\n",
      "Processing batch 1020/1563\n",
      "Processing batch 1021/1563\n",
      "Processing batch 1022/1563\n",
      "Processing batch 1023/1563\n",
      "Processing batch 1024/1563\n",
      "Processing batch 1025/1563\n",
      "Processing batch 1026/1563\n",
      "Processing batch 1027/1563\n",
      "Processing batch 1028/1563\n",
      "Processing batch 1029/1563\n",
      "Processing batch 1030/1563\n",
      "Processing batch 1031/1563\n",
      "Processing batch 1032/1563\n",
      "Processing batch 1033/1563\n",
      "Processing batch 1034/1563\n",
      "Processing batch 1035/1563\n",
      "Processing batch 1036/1563\n",
      "Processing batch 1037/1563\n",
      "Processing batch 1038/1563\n",
      "Processing batch 1039/1563\n",
      "Processing batch 1040/1563\n",
      "Processing batch 1041/1563\n",
      "Processing batch 1042/1563\n",
      "Processing batch 1043/1563\n",
      "Processing batch 1044/1563\n",
      "Processing batch 1045/1563\n",
      "Processing batch 1046/1563\n",
      "Processing batch 1047/1563\n",
      "Processing batch 1048/1563\n",
      "Processing batch 1049/1563\n",
      "Processing batch 1050/1563\n",
      "Processing batch 1051/1563\n",
      "Processing batch 1052/1563\n",
      "Processing batch 1053/1563\n",
      "Processing batch 1054/1563\n",
      "Processing batch 1055/1563\n",
      "Processing batch 1056/1563\n",
      "Processing batch 1057/1563\n",
      "Processing batch 1058/1563\n",
      "Processing batch 1059/1563\n",
      "Processing batch 1060/1563\n",
      "Processing batch 1061/1563\n",
      "Processing batch 1062/1563\n",
      "Processing batch 1063/1563\n",
      "Processing batch 1064/1563\n",
      "Processing batch 1065/1563\n",
      "Processing batch 1066/1563\n",
      "Processing batch 1067/1563\n",
      "Processing batch 1068/1563\n",
      "Processing batch 1069/1563\n",
      "Processing batch 1070/1563\n",
      "Processing batch 1071/1563\n",
      "Processing batch 1072/1563\n",
      "Processing batch 1073/1563\n",
      "Processing batch 1074/1563\n",
      "Processing batch 1075/1563\n",
      "Processing batch 1076/1563\n",
      "Processing batch 1077/1563\n",
      "Processing batch 1078/1563\n",
      "Processing batch 1079/1563\n",
      "Processing batch 1080/1563\n",
      "Processing batch 1081/1563\n",
      "Processing batch 1082/1563\n",
      "Processing batch 1083/1563\n",
      "Processing batch 1084/1563\n",
      "Processing batch 1085/1563\n",
      "Processing batch 1086/1563\n",
      "Processing batch 1087/1563\n",
      "Processing batch 1088/1563\n",
      "Processing batch 1089/1563\n",
      "Processing batch 1090/1563\n",
      "Processing batch 1091/1563\n",
      "Processing batch 1092/1563\n",
      "Processing batch 1093/1563\n",
      "Processing batch 1094/1563\n",
      "Processing batch 1095/1563\n",
      "Processing batch 1096/1563\n",
      "Processing batch 1097/1563\n",
      "Processing batch 1098/1563\n",
      "Processing batch 1099/1563\n",
      "Processing batch 1100/1563\n",
      "Processing batch 1101/1563\n",
      "Processing batch 1102/1563\n",
      "Processing batch 1103/1563\n",
      "Processing batch 1104/1563\n",
      "Processing batch 1105/1563\n",
      "Processing batch 1106/1563\n",
      "Processing batch 1107/1563\n",
      "Processing batch 1108/1563\n",
      "Processing batch 1109/1563\n",
      "Processing batch 1110/1563\n",
      "Processing batch 1111/1563\n",
      "Processing batch 1112/1563\n",
      "Processing batch 1113/1563\n",
      "Processing batch 1114/1563\n",
      "Processing batch 1115/1563\n",
      "Processing batch 1116/1563\n",
      "Processing batch 1117/1563\n",
      "Processing batch 1118/1563\n",
      "Processing batch 1119/1563\n",
      "Processing batch 1120/1563\n",
      "Processing batch 1121/1563\n",
      "Processing batch 1122/1563\n",
      "Processing batch 1123/1563\n",
      "Processing batch 1124/1563\n",
      "Processing batch 1125/1563\n",
      "Processing batch 1126/1563\n",
      "Processing batch 1127/1563\n",
      "Processing batch 1128/1563\n",
      "Processing batch 1129/1563\n",
      "Processing batch 1130/1563\n",
      "Processing batch 1131/1563\n",
      "Processing batch 1132/1563\n",
      "Processing batch 1133/1563\n",
      "Processing batch 1134/1563\n",
      "Processing batch 1135/1563\n",
      "Processing batch 1136/1563\n",
      "Processing batch 1137/1563\n",
      "Processing batch 1138/1563\n",
      "Processing batch 1139/1563\n",
      "Processing batch 1140/1563\n",
      "Processing batch 1141/1563\n",
      "Processing batch 1142/1563\n",
      "Processing batch 1143/1563\n",
      "Processing batch 1144/1563\n",
      "Processing batch 1145/1563\n",
      "Processing batch 1146/1563\n",
      "Processing batch 1147/1563\n",
      "Processing batch 1148/1563\n",
      "Processing batch 1149/1563\n",
      "Processing batch 1150/1563\n",
      "Processing batch 1151/1563\n",
      "Processing batch 1152/1563\n",
      "Processing batch 1153/1563\n",
      "Processing batch 1154/1563\n",
      "Processing batch 1155/1563\n",
      "Processing batch 1156/1563\n",
      "Processing batch 1157/1563\n",
      "Processing batch 1158/1563\n",
      "Processing batch 1159/1563\n",
      "Processing batch 1160/1563\n",
      "Processing batch 1161/1563\n",
      "Processing batch 1162/1563\n",
      "Processing batch 1163/1563\n",
      "Processing batch 1164/1563\n",
      "Processing batch 1165/1563\n",
      "Processing batch 1166/1563\n",
      "Processing batch 1167/1563\n",
      "Processing batch 1168/1563\n",
      "Processing batch 1169/1563\n",
      "Processing batch 1170/1563\n",
      "Processing batch 1171/1563\n",
      "Processing batch 1172/1563\n",
      "Processing batch 1173/1563\n",
      "Processing batch 1174/1563\n",
      "Processing batch 1175/1563\n",
      "Processing batch 1176/1563\n",
      "Processing batch 1177/1563\n",
      "Processing batch 1178/1563\n",
      "Processing batch 1179/1563\n",
      "Processing batch 1180/1563\n",
      "Processing batch 1181/1563\n",
      "Processing batch 1182/1563\n",
      "Processing batch 1183/1563\n",
      "Processing batch 1184/1563\n",
      "Processing batch 1185/1563\n",
      "Processing batch 1186/1563\n",
      "Processing batch 1187/1563\n",
      "Processing batch 1188/1563\n",
      "Processing batch 1189/1563\n",
      "Processing batch 1190/1563\n",
      "Processing batch 1191/1563\n",
      "Processing batch 1192/1563\n",
      "Processing batch 1193/1563\n",
      "Processing batch 1194/1563\n",
      "Processing batch 1195/1563\n",
      "Processing batch 1196/1563\n",
      "Processing batch 1197/1563\n",
      "Processing batch 1198/1563\n",
      "Processing batch 1199/1563\n",
      "Processing batch 1200/1563\n",
      "Processing batch 1201/1563\n",
      "Processing batch 1202/1563\n",
      "Processing batch 1203/1563\n",
      "Processing batch 1204/1563\n",
      "Processing batch 1205/1563\n",
      "Processing batch 1206/1563\n",
      "Processing batch 1207/1563\n",
      "Processing batch 1208/1563\n",
      "Processing batch 1209/1563\n",
      "Processing batch 1210/1563\n",
      "Processing batch 1211/1563\n",
      "Processing batch 1212/1563\n",
      "Processing batch 1213/1563\n",
      "Processing batch 1214/1563\n",
      "Processing batch 1215/1563\n",
      "Processing batch 1216/1563\n",
      "Processing batch 1217/1563\n",
      "Processing batch 1218/1563\n",
      "Processing batch 1219/1563\n",
      "Processing batch 1220/1563\n",
      "Processing batch 1221/1563\n",
      "Processing batch 1222/1563\n",
      "Processing batch 1223/1563\n",
      "Processing batch 1224/1563\n",
      "Processing batch 1225/1563\n",
      "Processing batch 1226/1563\n",
      "Processing batch 1227/1563\n",
      "Processing batch 1228/1563\n",
      "Processing batch 1229/1563\n",
      "Processing batch 1230/1563\n",
      "Processing batch 1231/1563\n",
      "Processing batch 1232/1563\n",
      "Processing batch 1233/1563\n",
      "Processing batch 1234/1563\n",
      "Processing batch 1235/1563\n",
      "Processing batch 1236/1563\n",
      "Processing batch 1237/1563\n",
      "Processing batch 1238/1563\n",
      "Processing batch 1239/1563\n",
      "Processing batch 1240/1563\n",
      "Processing batch 1241/1563\n",
      "Processing batch 1242/1563\n",
      "Processing batch 1243/1563\n",
      "Processing batch 1244/1563\n",
      "Processing batch 1245/1563\n",
      "Processing batch 1246/1563\n",
      "Processing batch 1247/1563\n",
      "Processing batch 1248/1563\n",
      "Processing batch 1249/1563\n",
      "Processing batch 1250/1563\n",
      "Processing batch 1251/1563\n",
      "Processing batch 1252/1563\n",
      "Processing batch 1253/1563\n",
      "Processing batch 1254/1563\n",
      "Processing batch 1255/1563\n",
      "Processing batch 1256/1563\n",
      "Processing batch 1257/1563\n",
      "Processing batch 1258/1563\n",
      "Processing batch 1259/1563\n",
      "Processing batch 1260/1563\n",
      "Processing batch 1261/1563\n",
      "Processing batch 1262/1563\n",
      "Processing batch 1263/1563\n",
      "Processing batch 1264/1563\n",
      "Processing batch 1265/1563\n",
      "Processing batch 1266/1563\n",
      "Processing batch 1267/1563\n",
      "Processing batch 1268/1563\n",
      "Processing batch 1269/1563\n",
      "Processing batch 1270/1563\n",
      "Processing batch 1271/1563\n",
      "Processing batch 1272/1563\n",
      "Processing batch 1273/1563\n",
      "Processing batch 1274/1563\n",
      "Processing batch 1275/1563\n",
      "Processing batch 1276/1563\n",
      "Processing batch 1277/1563\n",
      "Processing batch 1278/1563\n",
      "Processing batch 1279/1563\n",
      "Processing batch 1280/1563\n",
      "Processing batch 1281/1563\n",
      "Processing batch 1282/1563\n",
      "Processing batch 1283/1563\n",
      "Processing batch 1284/1563\n",
      "Processing batch 1285/1563\n",
      "Processing batch 1286/1563\n",
      "Processing batch 1287/1563\n",
      "Processing batch 1288/1563\n",
      "Processing batch 1289/1563\n",
      "Processing batch 1290/1563\n",
      "Processing batch 1291/1563\n",
      "Processing batch 1292/1563\n",
      "Processing batch 1293/1563\n",
      "Processing batch 1294/1563\n",
      "Processing batch 1295/1563\n",
      "Processing batch 1296/1563\n",
      "Processing batch 1297/1563\n",
      "Processing batch 1298/1563\n",
      "Processing batch 1299/1563\n",
      "Processing batch 1300/1563\n",
      "Processing batch 1301/1563\n",
      "Processing batch 1302/1563\n",
      "Processing batch 1303/1563\n",
      "Processing batch 1304/1563\n",
      "Processing batch 1305/1563\n",
      "Processing batch 1306/1563\n",
      "Processing batch 1307/1563\n",
      "Processing batch 1308/1563\n",
      "Processing batch 1309/1563\n",
      "Processing batch 1310/1563\n",
      "Processing batch 1311/1563\n",
      "Processing batch 1312/1563\n",
      "Processing batch 1313/1563\n",
      "Processing batch 1314/1563\n",
      "Processing batch 1315/1563\n",
      "Processing batch 1316/1563\n",
      "Processing batch 1317/1563\n",
      "Processing batch 1318/1563\n",
      "Processing batch 1319/1563\n",
      "Processing batch 1320/1563\n",
      "Processing batch 1321/1563\n",
      "Processing batch 1322/1563\n",
      "Processing batch 1323/1563\n",
      "Processing batch 1324/1563\n",
      "Processing batch 1325/1563\n",
      "Processing batch 1326/1563\n",
      "Processing batch 1327/1563\n",
      "Processing batch 1328/1563\n",
      "Processing batch 1329/1563\n",
      "Processing batch 1330/1563\n",
      "Processing batch 1331/1563\n",
      "Processing batch 1332/1563\n",
      "Processing batch 1333/1563\n",
      "Processing batch 1334/1563\n",
      "Processing batch 1335/1563\n",
      "Processing batch 1336/1563\n",
      "Processing batch 1337/1563\n",
      "Processing batch 1338/1563\n",
      "Processing batch 1339/1563\n",
      "Processing batch 1340/1563\n",
      "Processing batch 1341/1563\n",
      "Processing batch 1342/1563\n",
      "Processing batch 1343/1563\n",
      "Processing batch 1344/1563\n",
      "Processing batch 1345/1563\n",
      "Processing batch 1346/1563\n",
      "Processing batch 1347/1563\n",
      "Processing batch 1348/1563\n",
      "Processing batch 1349/1563\n",
      "Processing batch 1350/1563\n",
      "Processing batch 1351/1563\n",
      "Processing batch 1352/1563\n",
      "Processing batch 1353/1563\n",
      "Processing batch 1354/1563\n",
      "Processing batch 1355/1563\n",
      "Processing batch 1356/1563\n",
      "Processing batch 1357/1563\n",
      "Processing batch 1358/1563\n",
      "Processing batch 1359/1563\n",
      "Processing batch 1360/1563\n",
      "Processing batch 1361/1563\n",
      "Processing batch 1362/1563\n",
      "Processing batch 1363/1563\n",
      "Processing batch 1364/1563\n",
      "Processing batch 1365/1563\n",
      "Processing batch 1366/1563\n",
      "Processing batch 1367/1563\n",
      "Processing batch 1368/1563\n",
      "Processing batch 1369/1563\n",
      "Processing batch 1370/1563\n",
      "Processing batch 1371/1563\n",
      "Processing batch 1372/1563\n",
      "Processing batch 1373/1563\n",
      "Processing batch 1374/1563\n",
      "Processing batch 1375/1563\n",
      "Processing batch 1376/1563\n",
      "Processing batch 1377/1563\n",
      "Processing batch 1378/1563\n",
      "Processing batch 1379/1563\n",
      "Processing batch 1380/1563\n",
      "Processing batch 1381/1563\n",
      "Processing batch 1382/1563\n",
      "Processing batch 1383/1563\n",
      "Processing batch 1384/1563\n",
      "Processing batch 1385/1563\n",
      "Processing batch 1386/1563\n",
      "Processing batch 1387/1563\n",
      "Processing batch 1388/1563\n",
      "Processing batch 1389/1563\n",
      "Processing batch 1390/1563\n",
      "Processing batch 1391/1563\n",
      "Processing batch 1392/1563\n",
      "Processing batch 1393/1563\n",
      "Processing batch 1394/1563\n",
      "Processing batch 1395/1563\n",
      "Processing batch 1396/1563\n",
      "Processing batch 1397/1563\n",
      "Processing batch 1398/1563\n",
      "Processing batch 1399/1563\n",
      "Processing batch 1400/1563\n",
      "Processing batch 1401/1563\n",
      "Processing batch 1402/1563\n",
      "Processing batch 1403/1563\n",
      "Processing batch 1404/1563\n",
      "Processing batch 1405/1563\n",
      "Processing batch 1406/1563\n",
      "Processing batch 1407/1563\n",
      "Processing batch 1408/1563\n",
      "Processing batch 1409/1563\n",
      "Processing batch 1410/1563\n",
      "Processing batch 1411/1563\n",
      "Processing batch 1412/1563\n",
      "Processing batch 1413/1563\n",
      "Processing batch 1414/1563\n",
      "Processing batch 1415/1563\n",
      "Processing batch 1416/1563\n",
      "Processing batch 1417/1563\n",
      "Processing batch 1418/1563\n",
      "Processing batch 1419/1563\n",
      "Processing batch 1420/1563\n",
      "Processing batch 1421/1563\n",
      "Processing batch 1422/1563\n",
      "Processing batch 1423/1563\n",
      "Processing batch 1424/1563\n",
      "Processing batch 1425/1563\n",
      "Processing batch 1426/1563\n",
      "Processing batch 1427/1563\n",
      "Processing batch 1428/1563\n",
      "Processing batch 1429/1563\n",
      "Processing batch 1430/1563\n",
      "Processing batch 1431/1563\n",
      "Processing batch 1432/1563\n",
      "Processing batch 1433/1563\n",
      "Processing batch 1434/1563\n",
      "Processing batch 1435/1563\n",
      "Processing batch 1436/1563\n",
      "Processing batch 1437/1563\n",
      "Processing batch 1438/1563\n",
      "Processing batch 1439/1563\n",
      "Processing batch 1440/1563\n",
      "Processing batch 1441/1563\n",
      "Processing batch 1442/1563\n",
      "Processing batch 1443/1563\n",
      "Processing batch 1444/1563\n",
      "Processing batch 1445/1563\n",
      "Processing batch 1446/1563\n",
      "Processing batch 1447/1563\n",
      "Processing batch 1448/1563\n",
      "Processing batch 1449/1563\n",
      "Processing batch 1450/1563\n",
      "Processing batch 1451/1563\n",
      "Processing batch 1452/1563\n",
      "Processing batch 1453/1563\n",
      "Processing batch 1454/1563\n",
      "Processing batch 1455/1563\n",
      "Processing batch 1456/1563\n",
      "Processing batch 1457/1563\n",
      "Processing batch 1458/1563\n",
      "Processing batch 1459/1563\n",
      "Processing batch 1460/1563\n",
      "Processing batch 1461/1563\n",
      "Processing batch 1462/1563\n",
      "Processing batch 1463/1563\n",
      "Processing batch 1464/1563\n",
      "Processing batch 1465/1563\n",
      "Processing batch 1466/1563\n",
      "Processing batch 1467/1563\n",
      "Processing batch 1468/1563\n",
      "Processing batch 1469/1563\n",
      "Processing batch 1470/1563\n",
      "Processing batch 1471/1563\n",
      "Processing batch 1472/1563\n",
      "Processing batch 1473/1563\n",
      "Processing batch 1474/1563\n",
      "Processing batch 1475/1563\n",
      "Processing batch 1476/1563\n",
      "Processing batch 1477/1563\n",
      "Processing batch 1478/1563\n",
      "Processing batch 1479/1563\n",
      "Processing batch 1480/1563\n",
      "Processing batch 1481/1563\n",
      "Processing batch 1482/1563\n",
      "Processing batch 1483/1563\n",
      "Processing batch 1484/1563\n",
      "Processing batch 1485/1563\n",
      "Processing batch 1486/1563\n",
      "Processing batch 1487/1563\n",
      "Processing batch 1488/1563\n",
      "Processing batch 1489/1563\n",
      "Processing batch 1490/1563\n",
      "Processing batch 1491/1563\n",
      "Processing batch 1492/1563\n",
      "Processing batch 1493/1563\n",
      "Processing batch 1494/1563\n",
      "Processing batch 1495/1563\n",
      "Processing batch 1496/1563\n",
      "Processing batch 1497/1563\n",
      "Processing batch 1498/1563\n",
      "Processing batch 1499/1563\n",
      "Processing batch 1500/1563\n",
      "Processing batch 1501/1563\n",
      "Processing batch 1502/1563\n",
      "Processing batch 1503/1563\n",
      "Processing batch 1504/1563\n",
      "Processing batch 1505/1563\n",
      "Processing batch 1506/1563\n",
      "Processing batch 1507/1563\n",
      "Processing batch 1508/1563\n",
      "Processing batch 1509/1563\n",
      "Processing batch 1510/1563\n",
      "Processing batch 1511/1563\n",
      "Processing batch 1512/1563\n",
      "Processing batch 1513/1563\n",
      "Processing batch 1514/1563\n",
      "Processing batch 1515/1563\n",
      "Processing batch 1516/1563\n",
      "Processing batch 1517/1563\n",
      "Processing batch 1518/1563\n",
      "Processing batch 1519/1563\n",
      "Processing batch 1520/1563\n",
      "Processing batch 1521/1563\n",
      "Processing batch 1522/1563\n",
      "Processing batch 1523/1563\n",
      "Processing batch 1524/1563\n",
      "Processing batch 1525/1563\n",
      "Processing batch 1526/1563\n",
      "Processing batch 1527/1563\n",
      "Processing batch 1528/1563\n",
      "Processing batch 1529/1563\n",
      "Processing batch 1530/1563\n",
      "Processing batch 1531/1563\n",
      "Processing batch 1532/1563\n",
      "Processing batch 1533/1563\n",
      "Processing batch 1534/1563\n",
      "Processing batch 1535/1563\n",
      "Processing batch 1536/1563\n",
      "Processing batch 1537/1563\n",
      "Processing batch 1538/1563\n",
      "Processing batch 1539/1563\n",
      "Processing batch 1540/1563\n",
      "Processing batch 1541/1563\n",
      "Processing batch 1542/1563\n",
      "Processing batch 1543/1563\n",
      "Processing batch 1544/1563\n",
      "Processing batch 1545/1563\n",
      "Processing batch 1546/1563\n",
      "Processing batch 1547/1563\n",
      "Processing batch 1548/1563\n",
      "Processing batch 1549/1563\n",
      "Processing batch 1550/1563\n",
      "Processing batch 1551/1563\n",
      "Processing batch 1552/1563\n",
      "Processing batch 1553/1563\n",
      "Processing batch 1554/1563\n",
      "Processing batch 1555/1563\n",
      "Processing batch 1556/1563\n",
      "Processing batch 1557/1563\n",
      "Processing batch 1558/1563\n",
      "Processing batch 1559/1563\n",
      "Processing batch 1560/1563\n",
      "Processing batch 1561/1563\n",
      "Processing batch 1562/1563\n",
      "Processing batch 1563/1563\n"
     ]
    }
   ],
   "source": [
    "import math\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# Define the batch size so that it fits on your GPU. You can also do the processing on the CPU, but it will be slower.\n",
    "batch_size = 16\n",
    "\n",
    "# Path where the feature vectors will be stored\n",
    "features_path = Path(\"unsplash-dataset\") / dataset_version / \"features\"\n",
    "\n",
    "# Compute how many batches are needed\n",
    "batches = math.ceil(len(photos_files) / batch_size)\n",
    "\n",
    "# Process each batch\n",
    "for i in range(batches):\n",
    "    print(f\"Processing batch {i+1}/{batches}\")\n",
    "\n",
    "    batch_ids_path = features_path / f\"{i:010d}.csv\"\n",
    "    batch_features_path = features_path / f\"{i:010d}.npy\"\n",
    "    \n",
    "    # Only do the processing if the batch wasn't processed yet\n",
    "    if not batch_features_path.exists():\n",
    "        try:\n",
    "            # Select the photos for the current batch\n",
    "            batch_files = photos_files[i*batch_size : (i+1)*batch_size]\n",
    "\n",
    "            # Compute the features and save to a numpy file\n",
    "            batch_features = compute_clip_features(batch_files)\n",
    "            np.save(batch_features_path, batch_features)\n",
    "\n",
    "            # Save the photo IDs to a CSV file\n",
    "            photo_ids = [photo_file.name.split(\".\")[0] for photo_file in batch_files]\n",
    "            photo_ids_data = pd.DataFrame(photo_ids, columns=['photo_id'])\n",
    "            photo_ids_data.to_csv(batch_ids_path, index=False)\n",
    "        except:\n",
    "            # Catch problems with the processing to make the process more robust\n",
    "            print(f'Problem with batch {i}')"
   ]
  },
  {
   "source": [
    "Merge the features and the photo IDs. The resulting files are `features.npy` and `photo_ids.csv`. Feel free to delete the intermediate results."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# Load all numpy files\n",
    "features_list = [np.load(features_file) for features_file in sorted(features_path.glob(\"*.npy\"))]\n",
    "\n",
    "# Concatenate the features and store in a merged file\n",
    "features = np.concatenate(features_list)\n",
    "np.save(features_path / \"features.npy\", features)\n",
    "\n",
    "# Load all the photo IDs\n",
    "photo_ids = pd.concat([pd.read_csv(ids_file) for ids_file in sorted(features_path.glob(\"*.csv\"))])\n",
    "photo_ids.to_csv(features_path / \"photo_ids.csv\", index=False)"
   ]
  }
 ]
}